63 research outputs found

    Localizing the Latent Structure Canonical Uncertainty: Entropy Profiles for Hidden Markov Models

    Get PDF
    This report addresses state inference for hidden Markov models. These models rely on unobserved states, which often have a meaningful interpretation. This makes it necessary to develop diagnostic tools for quantification of state uncertainty. The entropy of the state sequence that explains an observed sequence for a given hidden Markov chain model can be considered as the canonical measure of state sequence uncertainty. This canonical measure of state sequence uncertainty is not reflected by the classic multivariate state profiles computed by the smoothing algorithm, which summarizes the possible state sequences. Here, we introduce a new type of profiles which have the following properties: (i) these profiles of conditional entropies are a decomposition of the canonical measure of state sequence uncertainty along the sequence and makes it possible to localize this uncertainty, (ii) these profiles are univariate and thus remain easily interpretable on tree structures. We show how to extend the smoothing algorithms for hidden Markov chain and tree models to compute these entropy profiles efficiently.Comment: Submitted to Journal of Machine Learning Research; No RR-7896 (2012

    Improved model identification for non-linear systems using a random subsampling and multifold modelling (RSMM) approach

    Get PDF
    In non-linear system identification, the available observed data are conventionally partitioned into two parts: the training data that are used for model identification and the test data that are used for model performance testing. This sort of 'hold-out' or 'split-sample' data partitioning method is convenient and the associated model identification procedure is in general easy to implement. The resultant model obtained from such a once-partitioned single training dataset, however, may occasionally lack robustness and generalisation to represent future unseen data, because the performance of the identified model may be highly dependent on how the data partition is made. To overcome the drawback of the hold-out data partitioning method, this study presents a new random subsampling and multifold modelling (RSMM) approach to produce less biased or preferably unbiased models. The basic idea and the associated procedure are as follows. First, generate K training datasets (and also K validation datasets), using a K-fold random subsampling method. Secondly, detect significant model terms and identify a common model structure that fits all the K datasets using a new proposed common model selection approach, called the multiple orthogonal search algorithm. Finally, estimate and refine the model parameters for the identified common-structured model using a multifold parameter estimation method. The proposed method can produce robust models with better generalisation performance

    sCompile: Critical path identification and analysis for smart contracts

    Get PDF
    Ethereum smart contracts are an innovation built on top of the blockchain technology, which provides a platform for automatically executing contracts in an anonymous, distributed, and trusted way. The problem is magnified by the fact that smart contracts, unlike ordinary programs, cannot be patched easily once deployed. It is important for smart contracts to be checked against potential vulnerabilities. In this work, we propose an alternative approach to automatically identify critical program paths (with multiple function calls including inter-contract function calls) in a smart contract, rank the paths according to their criticalness, discard them if they are infeasible or otherwise present them with user friendly warnings for user inspection. We identify paths which involve monetary transaction as critical paths, and prioritize those which potentially violate important properties. For scalability, symbolic execution techniques are only applied to top ranked critical paths. Our approach has been implemented in a tool called sCompile, which has been applied to 36,099 smart contracts. The experiment results show that sCompile is efficient, i.e., 5 seconds on average for one smart contract. Furthermore, we show that many known vulnerabilities can be captured if user inspects as few as 10 program paths generated by sCompile. Lastly, sCompile discovered 224 unknown vulnerabilities with a false positive rate of 15.4% before user inspection.Comment: Accepted by ICFEM 201

    A specialized learner for inferring structured cis-regulatory modules

    Get PDF
    BACKGROUND: The process of transcription is controlled by systems of transcription factors, which bind to specific patterns of binding sites in the transcriptional control regions of genes, called cis-regulatory modules (CRMs). We present an expressive and easily comprehensible CRM representation which is capable of capturing several aspects of a CRM's structure and distinguishing between DNA sequences which do or do not contain it. We also present a learning algorithm tailored for this domain, and a novel method to avoid overfitting by controlling the expressivity of the model. RESULTS: We are able to find statistically significant CRMs more often then a current state-of-the-art approach on the same data sets. We also show experimentally that each aspect of our expressive CRM model space makes a positive contribution to the learned models on yeast and fly data. CONCLUSION: Structural aspects are an important part of CRMs, both in terms of interpreting them biologically and learning them accurately. Source code for our algorithm is available at

    Automated detection of regions of interest for tissue microarray experiments: an image texture analysis

    Get PDF
    BACKGROUND: Recent research with tissue microarrays led to a rapid progress toward quantifying the expressions of large sets of biomarkers in normal and diseased tissue. However, standard procedures for sampling tissue for molecular profiling have not yet been established. METHODS: This study presents a high throughput analysis of texture heterogeneity on breast tissue images for the purpose of identifying regions of interest in the tissue for molecular profiling via tissue microarray technology. Image texture of breast histology slides was described in terms of three parameters: the percentage of area occupied in an image block by chromatin (B), percentage occupied by stroma-like regions (P), and a statistical heterogeneity index H commonly used in image analysis. Texture parameters were defined and computed for each of the thousands of image blocks in our dataset using both the gray scale and color segmentation. The image blocks were then classified into three categories using the texture feature parameters in a novel statistical learning algorithm. These categories are as follows: image blocks specific to normal breast tissue, blocks specific to cancerous tissue, and those image blocks that are non-specific to normal and disease states. RESULTS: Gray scale and color segmentation techniques led to identification of same regions in histology slides as cancer-specific. Moreover the image blocks identified as cancer-specific belonged to those cell crowded regions in whole section image slides that were marked by two pathologists as regions of interest for further histological studies. CONCLUSION: These results indicate the high efficiency of our automated method for identifying pathologic regions of interest on histology slides. Automation of critical region identification will help minimize the inter-rater variability among different raters (pathologists) as hundreds of tumors that are used to develop an array have typically been evaluated (graded) by different pathologists. The region of interest information gathered from the whole section images will guide the excision of tissue for constructing tissue microarrays and for high throughput profiling of global gene expression

    Gene selection for classification of microarray data based on the Bayes error

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy.</p> <p>Results</p> <p>In this study, we propose a new method, Based Bayes error Filter (BBF), to select relevant genes and remove redundant genes in classification analyses of microarray data. The effectiveness and accuracy of this method is demonstrated through analyses of five publicly available microarray datasets. The results show that our gene selection method is capable of achieving better accuracies than previous studies, while being able to effectively select relevant genes, remove redundant genes and obtain efficient and small gene sets for sample classification purposes.</p> <p>Conclusion</p> <p>The proposed method can effectively identify a compact set of genes with high classification accuracy. This study also indicates that application of the Bayes error is a feasible and effective wayfor removing redundant genes in gene selection.</p

    Gene selection for cancer classification with the help of bees

    Full text link

    Computers in Urban Planning in the Developing Countries

    Get PDF
    Invited conference closing pape
    corecore